335 research outputs found
Bilateral Personalized Dialogue Generation with Dynamic Persona-Aware Fusion
Generating personalized responses is one of the major challenges in natural
human-robot interaction. Current researches in this field mainly focus on
generating responses consistent with the robot's pre-assigned persona, while
ignoring the user's persona. Such responses may be inappropriate or even
offensive, which may lead to the bad user experience. Therefore, we propose a
bilateral personalized dialogue generation (BPDG) method with dynamic
persona-aware fusion via multi-task transfer learning to generate responses
consistent with both personas. The proposed method aims to accomplish three
learning tasks: 1) an encoder is trained with dialogue utterances added with
corresponded personalized attributes and relative position (language model
task), 2) a dynamic persona-aware fusion module predicts the persona presence
to adaptively fuse the contextual and bilateral personas encodings (persona
prediction task) and 3) a decoder generates natural, fluent and personalized
responses (dialogue generation task). To make the generated responses more
personalized and bilateral persona-consistent, the Conditional Mutual
Information Maximum (CMIM) criterion is adopted to select the final response
from the generated candidates. The experimental results show that the proposed
method outperforms several state-of-the-art methods in terms of both automatic
and manual evaluations.Comment: 14 pages, 6 figure
Robust Facial Expression Recognition with Convolutional Visual Transformers
Facial Expression Recognition (FER) in the wild is extremely challenging due
to occlusions, variant head poses, face deformation and motion blur under
unconstrained conditions. Although substantial progresses have been made in
automatic FER in the past few decades, previous studies are mainly designed for
lab-controlled FER. Real-world occlusions, variant head poses and other issues
definitely increase the difficulty of FER on account of these
information-deficient regions and complex backgrounds. Different from previous
pure CNNs based methods, we argue that it is feasible and practical to
translate facial images into sequences of visual words and perform expression
recognition from a global perspective. Therefore, we propose Convolutional
Visual Transformers to tackle FER in the wild by two main steps. First, we
propose an attentional selective fusion (ASF) for leveraging the feature maps
generated by two-branch CNNs. The ASF captures discriminative information by
fusing multiple features with global-local attention. The fused feature maps
are then flattened and projected into sequences of visual words. Second,
inspired by the success of Transformers in natural language processing, we
propose to model relationships between these visual words with global
self-attention. The proposed method are evaluated on three public in-the-wild
facial expression datasets (RAF-DB, FERPlus and AffectNet). Under the same
settings, extensive experiments demonstrate that our method shows superior
performance over other methods, setting new state of the art on RAF-DB with
88.14%, FERPlus with 88.81% and AffectNet with 61.85%. We also conduct
cross-dataset evaluation on CK+ show the generalization capability of the
proposed method
Learning to Locate Visual Answer in Video Corpus Using Question
We introduce a new task, named video corpus visual answer localization
(VCVAL), which aims to locate the visual answer in a large collection of
untrimmed instructional videos using a natural language question. This task
requires a range of skills - the interaction between vision and language, video
retrieval, passage comprehension, and visual answer localization. In this
paper, we propose a cross-modal contrastive global-span (CCGS) method for the
VCVAL, jointly training the video corpus retrieval and visual answer
localization subtasks with the global-span matrix. We have reconstructed a
dataset named MedVidCQA, on which the VCVAL task is benchmarked. Experimental
results show that the proposed method outperforms other competitive methods
both in the video corpus retrieval and visual answer localization subtasks.
Most importantly, we perform detailed analyses on extensive experiments, paving
a new path for understanding the instructional videos, which ushers in further
research.Comment: 4 pages, 2 figures and 3 table
- …